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MODELS  AND  ESTIMATION  PROCEDURES  FOR  THE  | 
ANALYSIS  OF  SUBJECTS-BY-ITEMS  DATA  ARRAYS:  | 


THE  FINAL  REPORT 


Overview 

The  problems  suggested  by  the  title  of  the  project  which  Is  the  subject  of 
this  report  are  of  Intrinsic  Interest  from  the  point  of  view  of  mathematical 
statistics  and  scaling  theory.  Unfortunately,  they  turn  out  not  to  be  as  central 
to  the  problems  of  model-based  psychological  measurement  as  I  originally  thought. 
Thus,  roughly  midway  through  the  project  I  started  to  focus  on  a  new  problem 
which  I  feel  Is  more  central  to  the  concerns  of  this  program  and  deserves  much 
more  attention  than  It  has  received  so  far.  In  this  report,  I  will  summarize 
the  progress  made  on  the  original  problems,  describe  some  of  the  unresolved  dif¬ 
ficulties,  and  explain  why  It  Is  Important  for  the  advancement  of  model-based 
psychological  measurement  to  address  the  new  problem  mentioned  above. 

The  problems  originally  considered  concern  estimation  of  parameters  in 
Tukey's  (1949)  model  for  data  In  a  two-way  array.  This  generalization  of  the 
additive  model  Is  Interesting  for  a  couple  of  reasons.  First,  data  arrays  which 
are  nonaddltlve,  but  can  be  rendered  additive  by  a  monotonic  transformation,  often 
fit  this  model  very  well.  Second,  the  model  has  some  of  the  same  advantages 
as  the  additive  model  for  purposes  of  scaling  row  and  column  effects  (subject 
ability  and  Item  difficulty  effects.  In  the  present  application)  .  Row  means  are 
Independent  of  the  column  parameters  and  vice  versa,  so  row  and  column  means 
can  provide  a  suitable  basis  for  scaling  the  row  and  column  effects.  For  these 
reasons,  Tukey's  model  Is  worth  considering  as  an  alternative  to  Rasch  models 
as  a  basis  for  scaling  unldlmenslonal  latent  traits. 

If  one  adopts  Tukey's  model  as  the  basis  for  testing,  there  are  two  ways 
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one  might  go  about  measuring  subject  and  item  effects,  assuming  the  original 
data  are  nonadditive.  One  might  seek  a  transformation  that  will  render  the 
data  additive.  To  Che  extent  chat  such  a  transformation  brings  Che  array  more 
into  line  with  Che  usual  A^rOVA  model  assumptions,  there  may  be  advantages  to 
carrying  out  estimation  in  terms  of  Che  transformed  data  and  Chen  transforming 
Che  results  back  into  Che  original  metric.  On  the  other  hand,  it  is  sometimes 
undesirable  or  useless  Co  transform  Che  original  data.  For  example,  transforma¬ 
tion  of  dichotomous  test  data  is  useless.  As  has  been  mentioned,  row  and  column 
means  of  Che  original  data  provide  a  reasonable  basis  for  measurement  of  row 
and  column  effects,  without  using  a  transformation.  The  work  of  this  project 
has  resulted  in  two  contributions  that  should  help  in  either  of  these  two 
approaches. 

If  Tukey's  model  holds  and  if  it  is  possible  to  render  Che  data  additive 
by  a  monoConic  transformation,  Chen  Che  required  transformation  is  a  simple  one 
implicit  in  the  model.  It  is  unique  up  to  a  linear  transformation.  Let  y 
represent  data  and  let  p  be  Che  grand  mean  of  the  array  and  \  be  Che  nonaddi- 
civity  parameter  in  the  model,  and  let  c*u-X'"^.  The  data  cannot  be  rendered 
additive  by  any  monotonlc  transformation  unless  either  E(y)>c  for  all  data 
points,  or  E(y)<c  for  all  data  points.  If  either  condition  is  satisfied,  then 
the  required  transformation  to  attain  additivity  is  log(y-c)  or  log(c-y),  depend¬ 
ing  on  which  condition  holds.  It  is  remarkable  that  this  result  has  not  been 
noted  before,  considering  Che  fact  that  Tukey's  model  is  so  well  known.  Tukey's 
model  and  results  concerning  it  are  summarized  in  more  detail  in  Che  next  section. 
A  complete  account  is  given  in  Technical  Report  80-1. 

The  problem  of  estimation  of  subject  and  item  effects  arises  whether  or  not 
one  transforms  Che  data  first.  The  usual  least  squares  estimators  are  unbiased 
and  consistent,  but  they  can  be  Improved  upon  using  an  empirical  Bayes  approach. 


Griffin  and  Krutchkoff  (1971)  have  shown  thac  che  opclnal  linear  esclraators  of 
marginal  effects  are  Che  sample  effects  multiplied  by  a  constant.  The  constant 
is  equal  to  Che  ratio  of  Che  variance  of  the  underlying  parameter  to  che  sum 
of  Chat  variance  and  che  conditional  variance  of  che  sample  effect,  given  che 
parameter,  k  very  Important  feature  of  their  result  for  our  purposes  Is  that 
it  does  not  depend  on  parametric  assumptions  concerning  che  distribution  of 
entries  in  Che  array.  It  applies  equally  well  Co  dichotomous,  right-wrong 
responses  and  to  response  latencies,  for  example.  I  have  shown  how  this  approach 
can  be  applied  to  Che  problem  of  estimation  of  several  proportions.  Jackson 
(1972)  suggested  a  method  similar  to  Che  one  I  propose,  but  Novlck,  Lewis,  and 
Jackson  (1973)  found  some  difficulties  with  his  suggestion.  These  difficulties 
lead  them  to  propose  a  more  complicated  approach.  A  key  problem  In  applying 
either  che  Griffin  and  Krutchkoff  approach  or  Jackson's  approach  Co  this  problem 
Is  estimation  of  Che  variance  of  Che  underlying  parameter.  I  propose  a  weighted 
estimator  of  this  variance  and  have  shown  chat,  In  the  context  of  estimation  of 
several  proportions.  It  avoids  che  serious  problems  noted  by  Novlck,  Lewis,  and 
Jackson.  These  developments  are  described  In  detail  In  Technical  Report  81-1, 
which  will  be  summarized  below,  in  che  third  section  of  this  report. 

The  research  Just  described  assumes  one  is  dealing  with  unidimensional 
traits.  There  Is  no  good  reason  to  assume  a  priori  that  a  unldlmenslonal  latent 
trait  Is  capable  of  representing  che  possible  states  of  learning  of  subjects 
with  regard  to  a  given  conceptual  domain.  Unldlmenslonal  latent  trait  models 
are  postulated  on  the  basis  of  mathematical  convenience.  Unfortunately,  there 
are  numerous  situations  In  achievement  testing  where  data  show  thac  unldlmen- 
slonallcy  does  not  hold. 

Failure  of  unldimenslonaliCy  can  have  serious  practical  consequences  In 
adaptive  mastery  testing,  one  of  che  areas  where  It  has  been  hoped  thac  latent 


trait  theory  might  prove  moat  beneficial.  If  a  slven  unidlmensional  model 
holds,  it  is  possible  to  determine  Che  status  of  subjects  Co  a  given  level  of 
precision  with  substantially  fewer  items  chan  a  standard  test  would  require. 

This  is  accomplished  by  taking  the  subject’s  responses  to  the  initial  items  on 
a  test  into  account  in  selecting  subsequent  items.  However,  if  the  model  on 
which  ItffiB  selection  is  based  is  wrong,  Chen  Che  adaptive  procedure  may  intro¬ 
duce  biases  and  ocher  inaccuracies  into  Che  assessment  of  the  subject  which  make 
it  worse  than  a  test  constructed  on  traditional  principles,  rather  than  better. 

Since  Che  success  of  adaptive  testing  may  well  hinge  on  Che  quality  of  Che 
models  it  is  based  on,  a  thorough  exploration  of  alternative  models  to  represent 
Che  state  of  subjects  in  different  cognitive  domains  should  be  a  first  priority 
for  future  research.  Such  work  should  draw  on  recent  developments  in  cognitive 
psychology,  but  it  will  need  to  go  considerably  beyond  them.  As  Susan  Whitely 
(1980)  has  pointed  out,  these  recent  developments  have  been  preponderantly  con¬ 
cerned  with  chronomeCrlc  studies  of  the  performance  of  subjects  who  are  competent 
at  Che  casks  they  xe  asked  to  do.  Dealing  with  Che  testing  problem  will  require 
models  that  repre:<enc  Che  state  of  subjects  who  are  still  learnlrg  Che  casks  in 
question  and  can  account  for  Che  subjects' patterns  of  correct  and  incorrect 
responses.  Multicomponent  latent  trait  models  represent  one  approach  to  Che 
problem  which  is  worth  pursuing.  An  even  more  promising  approach,  in  my  opinion, 
is  to  use  a  finite,  discrete  state  latent  space  to  represent  Che  state  of  Che 
subject.  The  all-or-none  models  of  concept  learning  popular  in  Che  1960's  are 
Che  sort  of  thing  I  have  in  mind,  but  one  must  go  considerably  beyond  chose  models 
Co  represent  Che  state  of  subjects  with  respect  to  concepts  one  would  be  inter¬ 
ested  in  testing  in  practical  situations.  Much  of  my  time  in  Che  latter  part  of 
this  project  was  spent  developing  a  proposal  outlining  how  these  ideas  might  be 
used  in  testing  mastery  of  signed-number  arithnetic.  Since  this  approach  is 


now  being  explored  on  a  new  contract,  detailed  discussion  of  it  will  be 
deferred  to  subsequent  reports. 

Tukev's  model  and  its  properties 


In  Tukey  s  two-way  AMOVA  model  with  a  single-degree  of  freedom  for  non¬ 
additivity  the  expected  value  of  the  score  of  subject  1  on  item  j  is 


Uij  ■  w  +  «!_  +  , 


The  linear  transformation  obtained  by  multiplying  both  sides  of  Equation  1  by  X 
and  adding  1  -  Xu  yields  a  multiplicative  form  of  the  model: 

Xuj_j  -  Xu  +  1  *  1  +  Xa^  +  XSj  +  X^aj^dj  (2) 

-  (1  +  Xa^Xl  +  XSj)  . 

If  1  +  Xa^  >  0  for  all  i  and  1  +  X6j  >  0  for  all  j ,  then  we  have  the  additive 
representation 

log(XUij  -  Xu  +  1)  *  log(l  +  Xoj^)  +  logd  +  XSj)  .  (3) 


A  theorem  of  Luce  and  Tukey  (1964)  Implies  that  this  additive  representation 
is  unique  up  to  a  linear  transformation,  when  it  exists.  Equation  3  implies 
chat  the  transformation  log(Xy  -  Xu  +  1}  should  yield  additivity  for  an  array 
of  data  satisfying  Tukey's  model,  provided  one  can  take  logarithms  on  both  sides 
of  Equation  2.  Obvious  linear  transformations  yield  more  convenient,  equivalent 
forms  of  the  transformation;  log(y  -  u  +  X“^)  if  X  is  positive,  log(u  -  X~^  -  y 
if  X  is  negative.  It  is  interesting  and  somewhat  puzzling  that  Tukey  never  men¬ 
tions  this  transformation  explicitly  in  connection  with  his  model.  He  does 
mention  transformations  of  the  form  log(y  -  c)  in  passing,  but  not  the  Important 
point  that  c  «  u  -  X**^.  The  role  of  c  in  these  transformations  is  analogous  to 


the  cole  of  Che  exponent  in  power  transformations,  so  Che  fact  that  c  Is  deter¬ 
mined  by  Che  model  parameters  is  significant. 

The  number  u  -  has  a  geometric  interpretation  which  shows  an  interesting 
aspect  of  Tukey's  model.  Notice  Chat  for  a  given  subject  1,  Che  regression  of 
Che  expected  score  on  Che  Sj's  is  linear,  with  slope  1  +  If  we  omit  Che 

subscript  on  S  in  Equation  1  to  indicate  that  it  is  variable  and  write  u^CS)  in 
place  of  ,  we  obtain 

Uj^(S)  -  u  +  +  (1  +  Xaj^)0  .  (4) 

Solving  for  Che  point  of  intersection  of  any  two  distinct  regression  lines  given 
by  Equation  4  yields  B  at  which  point  the  expected  score  is  ■ 

u  -  Thus,  in  graphical  terms,  Tukey's  model  posits  linear  regression  lines 

for  each  subject  which  converge  to  a  common  focal  value.  The  expected  score  at 
that  value  is  ji  -  X“^. 

It  is  easy  to  verify  the  following  properties  of  deviations  from  the 
focal  value: 

1.  The  overall  mean  deviation  is 

2.  The  ratio  of  the  mean  deviation  for  subject  i  to  the  overall  mean 
deviation  is  1  +  Xa^. 

3.  The  ratio  of  the  mean  deviation  for  item  j  to  Che  overall  mean  devi¬ 
ation  is  1  -<■  Xgj . 

4.  The  expected  deviation  from  Che  focal  value  for  subject  1  on  item  j  is 
equal  to  Che  product  of  Che  mean  deviation  for  subject  1  and  the  mean 
deviation  for  item  j,  divided  by  Che  overall  mean  deviation. 

If  U  ~  X~^,  Che  focal  value  is  zero  and  the  remarks  above  hold  for  Che  scores 
themselves.  The  last  remark  can  Chen  be  strengthened  to  say  that  Che  expected 
score  for  subject  i  on  item  j  is  equal  to  Che  product  of  Che  mean  score  for 
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subject  1  and  the  mean  score  on  item  j,  divided  by  the  overall  mean  score.  This 
is  the  multiplicative  property  vhlch  is  the  foundation  of  Rasch's  (1959)  Poisson 
and  gamma  models.  Rasch’s  model  for  analysis  of  dichotomous  items  has  the  same 
multiplicative  structure  if  we  regard  the  dependent  variable  to  be  the  odds  in 
favor  of  correct  response.  Thus,  Tukey's  model  is  seen  to  be  a  direct  generali¬ 
zation  of  the  Rasch  models. 

The  generalization  of  Rasch  models  represented  by  Tukey's  model  costs  less 
in  loss  of  desirable  properties  than  one  might  expect.  One  obtains  in  return  a 
model  with  a  much  larger  range  of  applicability.  It  is  true  that  the  property 
of  separability  of  item  and  subject  parameters  which  Rasch  has  emphasized  does 
not  hold  in  the  strict  sense  employed  by  Rasch.  Item  parameters  are  separable 
from  subject  parameters  in  Rasch's  sense  if  the  conditional  probability  of  a 
subject  making  a  given  response  to  an  item,  given  the  subject's  total  score  on 
all  items,  depends  only  on  the  item  parameters,  not  on  the  subject  parameter. 

This  is  not  the  case  in  the  general  model,  but  the  item  parameters  are  neverthe¬ 
less  separable  from  the  subject  parameters  in  a  weaker  sense  that  is  useful. 

The  deviation  of  the  mean  score  on  an  item  from  the  overall  mean  depends  only 
on  the  item  parameter,  not  on  the  subject  parameters.  In  fact,  the  usual  least 
squares  estimators  of  subject  and  item  effects  for  the  two-way  ANOVA  model  are 
unbiased  and  consistent. 

It  is  easy  to  verify  that  item  and  subject  parameters  are  not  separable  in 
Rasch's  sense,  even  if  we  restrict  ourselves  to  an  additive  model  with  standard 
normal  distribution  assumptions.  This  is  the  best  behaved  model  that  one  can 
assume  in  terms  of  how  well  it  lends  itself  to  estimation  of  parameters.  The 
fact  that  its  item  and  subject  parameters  are  not  separable  in  Rasch's  sense  sug¬ 
gests  that  Rasch's  sense  of  separability  is  an  unnecessarily  restrictive  requirement 
to  place  on  models  for  subjects-by-items  arrays. 
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Some  interesting  properties  of  the  Rasch  models  do  carry  over  to  the  more 
general  model  if  one  assumes,  as  Rasch  did,  that  the  conditional  distribution  of 
responses,  given  the  subject  and  item  parameters,  is  a  Poisson  or  a  gamma  dis¬ 
tribution.  Due  to  the  special  reproductive  properties  of  these  distributions, 

Che  marginal  sums  over  subjects  or  items  are  distributed  as  Poisson  or  gamma 
variates;  i.e.,  Che  form  of  the  conditional  distribution  carries  over  to  the 
marginal  distribution.  TurChermore,  the  parameters  of  Che  distribution  of  Che 
marginal  sums  for  each  student  depend  only  on  Che  student  parameters  and  Che 
parameters  of  Che  distributions  of  Che  marginal  sums  for  each  item  depend  only 
on  the  item  parameters.  Thus,  subject  and  item  parameters  are  separable  in 
Tukey's  model  in  senses  which  are  important  for  Che  problem  of  parameter  esti¬ 
mation.  We  turn  now  Co  this  problem. 

Estimation  of  the  parameters  in  Tukey*s  model 

It  has  already  been  noted  that  deviations  of  subject  and  item  marginal 
means  from  Che  grand  mean  are  consistent,  unbiased  estimators  of  subject  and 
item  effects  in  Tukey's  model.  Recent  work  on  empirical  Bayes  estimation  pro¬ 
cedures  suggests  ways  Co  -Improve  upon  Che  least  squares  estimators.  Previous 
work  has  shown  Chat  Che  usual  estimator  of  Che  nonaddlclvicy  parameter  in 
Tukey's  model  can  be  severely  biased.  Because  the  usual  estimator  depends  on 
estimates  of  the  marginal  effects,  it  seemed  chat  Che  corrections  to  Che  least 
squares  estimators  of  marginal  effects  which  the  empirical  Bayes  estimators 
make  might  lead  Co  a  corresponding  correction  in  Che  bias  of  the  nonaddlclvicy 
parameter  estimator.  This  project  has  explored  these  questions  in  detail. 

The  study  of  Che  empirical  Bayes  estimators  of  subject  and  item  effects  has  led 
Co  some  useful  results,  but  carryover  of  improvement  Co  the  estimator  of  X  has  not 
worked  out  as  hoped. 
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Estlmaclon  of  subject  and  item  effects .  Let  X  be  a  random  variable  with 

expected  value  m.  Suppose  we  have  a  random  sample  of  n  different  X's  from  a 

population  in  which  the  mean  m  is  not  constant,  but  has  a  distribution  with 

mean  u  and  variance  a^.  Let  be  the  unconditional  variance  of  X.  In  the 
m  m  X 

present  context,  X  might  be  a  marginal  mean  for  a  given  subject,  m  the  true 
mean  for  the  subject,  o^  the  true  variance  in  ^  among  subjects,  and  Che 
variance  of  Che  sample  marginal  means  for  subjects. 

The  question  arises  about  how  best  to  estimate  the  separate  means  using 
Che  values  of  Che  separate  X's  obtained.  One  approach  would  be  to  use  each 
X  to  estimate  Che  mean  in  its  subpopulation.  If  Che  X's  are  subject  means, 
for  example,  we  would  use  Che  subject  means  themselves  to  estimate  Che  subjec 
true  scores.  On  the  other  hand,  if  the  sample  variance  of  the  X's  observed  i. 
close  Co  what  one  would  expect  in  sampling  from  a  population  with  a  single,  con¬ 
stant  value  of  m,  it  might  be  better  to  use  Che  grand  mean  of  Che  X's  as  a 
common  estimate  of  m  for  all  the  means.  Griffin  and  Krutchkoff  (1971)  have 
shown  that  if  one  restricts  oneself  Co  estimators  which  are  linear  functions 
Che  X's,  the  best  estimators  in  the  sense  of  minimizing  overall  squared  error 
are  given  by  the  following  compromise  between  Che  possibilities  Just  mentioned: 

m(X)  -  CX  +  (1  -  C)u 
—  n 

-  +  C(x  -  uj,  (5) 

u  m 

where  C  ■  possibilities  mentioned  above  correspond  to  letting 

C  ■  0  if  0^  “  0,  and  C  ■  1  if  ■  o^. 

Implementation  of  Che  optimal  linear  estimators  requires  knowledge  of 
and  0^  which  we  usually  lack.  The  sample  grand  mean  is  a  good  estimator  of  u^, 

but  Che  estimation  of  can  be  something  of  a  problem.  The  standard  estimator 

n 
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of  3^  from  the  analysis  of  variance  components  is  probably  satisfactory  for 
present  purposes,  but  it  cannot  be  claimed  that  the  estimators  of  marginal 
effects  which  result  from  using  these  estimates  of  and  are  optimal.  They 
do  seem  to  be  an  improvement  on  the  least  squares  estimators. 


Estimation  of  the  nonadditivity  parameter.  Let  y^^  be  the  score  of  subject 
i  on  item  j  and  let  and  3j  be  the  least  squares  estimators  of  the  subject  and 
item  effects  for  subject  i  and  item  j,  respectively.  The  usual  estimator  of  the 
nonadditivity  parameter  X  is  given  by 


(6) 


The  expected  value  of  the  numerator  of  the  right  side  of  Equation  6  is  X  E 

IJ  i  J 

The  denominator  is  an  estimator  of  E  so  X  is  a  plausible  estimator  of  A. 

i,j  X  J 

Unfortunately,  while  and  8j  are  unbiased  estimators  of  and  8j ,  and 

A 

ESj  tend  to  overestimate  the  respective  sums  of  squared  effects.  Their  fixpec- 

tations  involve  as  well  as  and  As  a  result,  X  tends  to  underestimate 

e  a  6 

X  in  absolute  value.  The  bias  can  be  substantial  in  realistic  settings.  The 
median  of  X  is  approximately 


median  X  A  C  C-A, 
a  8  ’ 


wnere 


Jo? 

_ 2_ 


<^1  +  J<Ja 


and 


Cg  - 


noa 


+  nof 
e  Q 


(7) 

(8) 


Note  that  and  Cg  are  the  correction  factors  by  which  the  optimal  linear  esti¬ 
mators  of  Griffin  and  Krutchkoff  would  correct  the  least  squares  estimators  of 
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subject  and  icen  effects,  respectively 

Equation  7  sus8®3ts  that  we  might  Improve  on  X  by  multiplying  by 

This  would  undoubtedly  help  if  we  knew  Cg  and  Cg,  but  we  do  not  know  them  In 
most  situations,  so  va  must  estimate  them.  We  did  simulation  studies  which 
show  that  multiplying  X  by  analysis-of-variance  components  estimators  of  C~* 
and  Cg^  does  reduce  bias,  but  at  a  disastrous  cost  in  variability.  The  result¬ 
ing  estimator  has  substantially  greater  squared  error  than  X.  A  good  estimator 
of  X  is  yet  to  be  found. 

It  is  important  to  put  the  difficulties  in  Che  estimation  of  X  in  proper 
perspective.  The  usual  estimator  is  consistent,  which  is  about  as  much  as  can 
be  claimed  for  the  available  estimators  of  parameters  in  competing  logistic 
models.  The  fact  that  Che  escimation  of  X  poses  some  problems  should  not  in 
itself  lead  one  to  dismiss  Che  Tukey  model  as  an  attractive  alternative  to 
Rasch  models.  For  one  thing,  Che  difficulties  do  not  affect  Che  estimation 
of  Che  parameters  of  most  concern,  the  subject  and  item  parameters.  On  the 
other  hand,  the  uncertainty  regarding  X  would  affect  the  application  of  Tukey 's 
model  in  adaptive  testing. 

Suppose  one  is  trying  to  estimate  for  some  subject  on  Che  basis  of  the 
subject's  performance  on  a  small  subset  of  the  items.  Let  be  the  subject's 
average  score  on  Che  subset  of  items  and  let  S  be  Che  average  item  effect  for 
items  in  Che  subset.  Then  Che  method  of  moments  estimator  of  is 


-  U  -  B 


1  +  XS 


(9) 


Clearly,  Che  value  of  X  would  affect  judgements  about  a^.  If  all  items  were 
employed,  B  would  be  zero  and  would  not  be  affected  by  X,  but  one  of  Che 


main  points  Co  adaptive  testing  is  to  tailor  the  Items  to  either  Che  ability 
of  Che  subject  or  Che  ability  level  adopted  as  a  cutoff,  depending  on  the  test¬ 
ing  context.  In  either  case,  B  is  unlikely  to  zero.  The  method  of  moments 
estimator  is  probably  not  Che  best  we  can  do  in  this  situation,  but  whatever 

approach  is  employed  would  be  affected  by  X,  so  it  would  be  desirable  to  have 

a  better  estimator  of  this  parameter. 

Empirical  Bayes  estimation  of  proportions  in  several  groups.  The  optimal 
linear  estimators  of  Griffin  and  Krutchkoff  can  be  applied  to  Che  problem  of 
simultaneous  estimation  of  proportions  in  several  groups.  In  order  to  do  so 
it  is  necessary  Co  estimate  Che  variance  of  Che  underlying  proportions.  Griffin 
and  Krutchkoff  do  not  deal  with  Che  unbalanced  case,  where  the  observed  propor¬ 
tions  are  based  on  different  sample  sizes.  In  Technical  Report  81-1,  I  propose 
the  following  estimator  for  o^,  the  between-group  variance  component  in  the 
unbalanced  one-way  random  effects  AHOVA  model: 

a2  -  (J  -  1)  (N  -  {^between  '  ^within)  •  (10) 

Here  J  is  number  of  groups,. n^  Che  number  of  observations  in  group  j,  and  N  is 

the  total  number  of  observations.  This  estimator  is  an  unbiased,  consistent 
estimator  of  even  if  Che  usual  assumptions  of  normality  and  homoscedasticity 
fail  to  hold.  When  Equation  10  is  specialized  Co  Che  case  of  binomial  propor¬ 
tions,  Che  formulas  for  Che  mean  sums  of  squares  can  be  simplified  to  expressions 
Involving  Che  numbers  of  successes  in  each  group,  rj's,  and  Che  total  number  of 
successes,  R.  The  estimator  of  the  variance  of  the  underlying  proportions  becomes 


13 


of  data  gathered  in  educational  contexts.  In  several  of  the  applications,  they 
found  the  approach  less  than  satisfactory.  The  main  problem  seems  to  be  the 
sampling  variability  of  the  estimator  of  the  variance  of  the  underlying  para¬ 
meter.  For  example,  a  negative  estimate  of  variance  occurs  in  one  of  the 
applications.  Novlck-  et  al.  argue  that  the  difficulties  with  estimation  of 
the  between-group  variance  component  show  the  need  for  a  more  rigorously 
Bayesian  approach  in  which  prior  beliefs  about  the  extent  of  between-group 
variation  can  be  integrated  with  sample  information. 

When  Griffin  and  Krutchkaff's  method  is  applied  to  the  data  of  Novick 
et  al.,  with  weighted  estimates  of  the  problems  encountered  by  Jackson's 
approach  do  not  arise.  The  resulting^  estimates  are  in  line  with  some  of  the 
Bayesian  solutions  considered  by  Novlck  et  al.  This  suggests  that  the  use  of 
unweighted  estimators  might  be  the  root  of  the  problem  with  Jackson's  approach, 
rather  than  the  failure  to  incorporate  prior  beliefs  into  the  estimates. 

Let  gj'  denote  the  root-arcsine  transformation  of  the  observed  proportion 
of  successes  in  group  J : 


Let  y  be  the  transformed  value  of  a  proportion  p.  Jackson  estimates  the  mean 
and  variance  of  the  distribution  of  y  by 

Vy  -  jfj 
J 


(16) 
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respectively.  The  conditional  variance  of  gj  is  (4nj  +  2)"^,  so  5^  is  the 
difference  between  the  sample  variance  of  the  gj's  and  Che  average  of  their 
conditional  variances.  The  following  estimators  of  and  o^  weigh  the  con¬ 
tributions  from  each  group  inversely  to  the  conditional  variance  of  g^  : 


g.  -  (4N  +  2J)"^  r(4nj  +  2)gj  ,  (18) 

52  -  An  +  2J  -  r(4nj  2)2N  +  2)(gj  -  g.)2  _  (j  -  , 

'  4N  +  2J  / 


When  Jackson's  method  is  applied  to  the  data  of  Novlck  et  al.  using  the 
weighted  estimators  in  Equation  IS  Instead  of  the  unweighted  estimators  in 
Equations  16  and  17,  the  results  are  practically  identical  to  the  Griffin  and 
Krutchkoff  estimates.  On  the  face  of  it,  weighted  estimators  would  seem  to 
be  preferable  to  unweighted  estimators,  but  under  some  circumstances  one  would 
be  coo  hasty  in  drawing  this  conclusion.  Both  weighted  and  unweighted  esti¬ 
mators  can  yield  negative  estimates  of  Che  betveen-group  variance  component 
and  Tukey  (1957)  has  shown  that  there  are  situations  in  Che  unbalanced  case 
where  unweighted  estimators  are  better  than  weighted. 

According  Co  Tukey  (1957) ,  two  of  the  most  important  factors  determining 
whether  weighted  or  unweighted  estimators  will  yield  more  accurate  estimates 
of  Che  between-group  variance  component  are  the  size  of  true  between-group  var¬ 
iance  and  Che  variability  of  sample  size  from  group  to  group.  Ocher  things 
being  equal,  large  true  variation  between  groups  favors  unweighted  estimators 
over  weighted  escimators,  whereas  large  variability  in  sample  size  favors 
weighted  estimators.  In  Monto  Carlo  simulations  described  in  Technical  Report 
81-1,  under  conditions  mimicking  chose  in  the  examples  analyzed  by  Novlck  et  al.. 
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we  found  that  the  differences  In  performance  between  weighted  and  unweighted 
estimators  of  the  between^group  variance  are  not  significant  when  the  true 
variation  between  groups  is  large  and  the  variability  in  sample  size  is  mod¬ 
erate;  that  is,  under  conditions  that  favor  unweighted  estimators.  On  the 
other  hand,  the  differences  in  favor  of  weighted  estimators  are  substantial 
when  the  sample  sizes  are  extremely  variable  and  the  between-group  variance 
component  is  small.  It  is  therefore  recommended  that  weighted  estimators  be 
used  a;>  a  general  practice  when  applying  either  the  Griffin  and  Krutchkoff  or 
the  Jackson  procedure. 

From  a  philosophical  point  of  view,  some  might  prefer  the  rigorously 
Bayesian  approach  of  Novick  et  al.  to  the  empirical  Bayes  approaches  of  Griffin 
and  Krutchkoff  or  Jackson.  However,  in  applying  Che  rigorously  Bayesian  ap¬ 
proach  there  are  problems  of  computational  complexity  and  technical  problems 
concerning  specification  of  prior  distributions  which  might  lead  even  a 
convinced  Bayesian  to  prefer  Che  empirical  Bayes  approaches.  These  problems 
are  discussed  iiLdetall'ln  Technical  Report  81-1.  The  work  on  this  project 
shows  Chat  Che  use  of  weighted  estimators  of  Che  between-group  variance  com¬ 
ponent  answers  Che  most  serious  objections  of  Novlck  et  al.  to  Che  empirical 
Bayes  approach. 

Conclusions  and  recommendations 

The  usefulness  of  Tukey's  model  for  model-based  psychological  testing  is 
probably  greatest  for  analyses  of  responses  which  are  not  dichotomous,  such  as 
response  latencies.  There  are  not  many  models  for  such  responses  Co  choose 
from,  ocher  chan  Che  additive  ANOVA  model  and  Rasch's  multiplicative  models. 
Tukey's  model  is  a  generalization  of  these  models  which  costs  very  little  in 
terms  of  loss  of  their  desirable  properties.  Nondichotomous  variables  have  not 
traditionally  been  the  focus  of  much  attention  in  test  development.  However, 


recent  efforts,  such  as  Sternberg  (1977)  and  Thlssen  (1979),  to  relate  the 
chronooetrlc  methods  of  analysis  frequently  employed  in  th<.  study  of  cognition 
to  psychometrics,  suggest  that  the  analysis  of  response  latencies  will  be  of 
Increasing  importance  in  the  future.  The  results  of  this  project  should  be 
helpful  In  these  analyses. 

When  the  response  variable  Is  dichotomous,  Tukey's  model  may  or  may  not 
be  competitive  with  Rasch  models  and  other  logistic  models.  Whether  it  is  or 
not  undoubtably  depends  on  the  specific  test  in  question.  Tukey's  model  is 
most  likely  to  be  adequate  in  testing  situations  with  the  following  characteristics: 

1.  The  items  are  unidimensional. 

2.  The  Rasch  model  is  not  appropriate  for  some  reason,  such  as  unequally 
-  discriminating  items, 

3.  The  number  of  subjects  is  too  small  to  adequately  estimate  the  para¬ 
meters  in  the  more  flexible  two-  or  three-parameter  logistic  models. 

Under  these  circumstances.  Lord  (1979)  has  shown  that  the  use  of  Rasch  methods 
is  preferable  to  methods  based  on  the  two-  or  three-parameter  logistic  models, 
even  when  a  two-  or  three-parameter  logistic  model  generates  the  data.  Al¬ 
though  the  studies  comparing  the  performance  of  methods  based  on  Tukey's  model 
with  the  performance  of  the  Rasch  methods  have  not  been  done,  it  seems  plausible 
that  Tukey's  model  would  often  provide  a  better  approximation  than  the  Rasch 
methods  when  the  conditions  listed  above  hold. 

Unfortunately,  these  conditions  define  an  uncomfortably  narrow  range  of 
applicability  for  Tukey's  model — narrower  than  is  perhaps  apparent  on  first 
reading.  It  is  not  hard  to  find  situations  satisfying  the  second  and  third 
conditions.  The  Ranch  model  is  often  too  simple  to  fit  the  data,  and  Lord's 
results  suggest  that  it  is  impractical  to  use  more  complex  logistic  models 
unless  there  are  at  least  two  or  three  hundred  subjects  available  for  item 
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calibration.  The  unidimenaionality  condition  is  the  hard  one  to  satisfy.  How 
hard  it  is  likely  to  be  is  not  sufficiently  appreciated  yet  by  wurkers  in  the 
field. 

A  study  of  Tatsouka  and  Birenbaum  (1979)  of  the  performance  of  prealgebra 
students  on  a  test  of  signed-number  arithmetic  illustrates  a  problem  which  makes 
the  achievement  of  unidimensionality  problematic.  The  students  all  took  computer- 
assisted  drills  on  signed  numbers  but  their  instruction  in  class  varied  from 
teacher  to  teacher.  If  their  test  performance  could  be  adequately  represented 
by  a  unidimensional  latent  trait  model,  then  item  characteristic  curves  based 
on  different  samples  of  subjects  ought  to  coincide.  Tatsuoka  and  Birenbaum 
found  dramatic  differences  between  the  item  characteristic  curves  for  different 
classes,  differences  which  they  could  relate  plausibly  to  specific  item  types 
and  differences  in  the  approaches  taken  by  the  various  classroom  teachers  in 
presenting  the  signed-number  concept. 

Failures  of  unidimensionality  can  often  be  attributed  to  the  inclusion  of 
items  in  a  test  which  clearly  tap  different  skills.  This  argument  does  not 
apply  in  the  present  case.  The  test  of  signed-number  arithmetic  was  very  care¬ 
fully  designed  to  represent  just  about  as  precisely  and  narrowly  defined  a 
domain  as  one  could  have  and  still  have  a  testing  situation  of  genuine  applied 
Interest.  Upon  reflection,  it  should  not  be  too  surprising  to  find  that  uni- 
dlmensionallty  is  difficult  to  achieve  with  such  tests.  Most  achievement  tests 
are  Intended  to  assess  knowledge  of  material  from  a  specified  domain,  without 
reference  to  how  the  material  was  taught.  It  is  reasonable  to  suppose  that  the 
relative  difficulty  of  test  items  depends  on  how  the  concepts  which  the  items 
are  designed  to  test  were  taught  and  learned.  Due  to  variation  in  instruction, 
which  will  usually  occur  to  some  extent,  even  in  contexts  where  student  progress 
is  monitored  automatically,  examinees  will  differ  with  respect  to  how  much  the 
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macerlal  to  be  tested  Is  studied  and  how  the  study  time  Is  used.  As  a  result 
we  should  often  expect  to  find  the  variation  from  class  to  class  In  Item  char¬ 
acteristic  curves  which  Tatsuoka  and  Blrenbaum  found  with  signed-number  arith¬ 
metic.  Unldlmenslonallcy  Is  likely  to  be  an  exceptional  occurence  In  achievement 
tests  of  narrowly  defined  domains. 

Violation  of  the  unldlmenslonallty  assumption  can  have  serious  consequences 
for  adaptive  tests  of  concept  mastery.  Any  adaptive  procedure  based  on  a  unl- 
dlmenslonal  latent  trait  model  would  result  In  tests  comprised  of  decidedly 
nonrandom  samples  of  Items  from  the  Item  pool.  If  the  model  Is  Incorrect,  then 
the  procedure  Is  quite  likely  to  be  Inappropriate  and  the  resulting  measurements 
misleading. 

When  I  undertook  this  project,  I  tacitly  assumed  chat  many  mastery  testing 
situations  would  satisfy  the  conditions  under  which  Tukey's  model  would  be  likely 
to  be  of  most  use.  I  was  not  Chinking  enough  about  Che  unldlmenslonallcy  assump¬ 
tion  which  Tukey’s  model  shares  with  most  of  the  ocher  extant  latent  trait  models. 
The  areas  where  uni dimensional  representations  are  inadequate  are  areas  of  con¬ 
siderable  Importance  In  military  training  and  education  generally.  Many  critical 
specialized  training  modules  Involve  concepts  analogous  to  signed-number  arith¬ 
metic.  There  must  be  better  ways  to  conceive  of  mastery  of  these  concepts  than 
to  chink  of  It  In  terms  of  a  subject's  position  on  a  continuous  latent  trait. 
Finding  adequate  representations  of  Che  performance  of  subjects  on  tests  of  these 
concepts  should  have  at  least  equal  priority  with  further  development  of  unl- 
dlmenslonal  models. 

A  good  clue  about  a  possible  direction  to  pursue  In  seeking  simple  models 
of  concept  mastery  Is  implicit  in  Che  observations  of  Tatsuoka  and  Blrenbaum 
and  In  closely  related  work  of  Brown  and  Burton  (1978).  Careful  analyses  by 
these  Investigators  of  the  patterns  of  responses  to  Items  on  tests  of  simple 


mathefflacical  concepts  show  chat  subjects  who  have  not  mastered  a  concept  often 
fall  into  distinctive  patterns  of  errors,  pattern-^  which  derive  from  systematic 
misconceptions.  For  example,  on  signed-number  addition  problems,  some  students 
get  the  sign  right  on  all  the  problems,  but  always  add  Che  absolute  values  of 
Che  addends  to  get  Che  absolute  values  of  Che  sums,  rather  chan  subtracting  Che 
absolute  values  when  Che  signs  of  Che  addends  disagree.  If  a  significant  fraction 
of  Che  subjects  have  this  misconception,  that  in  itself  leads  to  a  violation  of 
unidlmensionality.  The  fact  Chat  relatively  few  systematic  misconceptions  can 
account  for  a  majority  of  Che  response  patterns  of  subjects  who  have  not  mastered 
a  concept  suggests  an  alternative  model  to  serve  as  a  basis  for  testing  mastery 
of  Che  concept.  Instead  of  characterizing  subjects  by  positions  on  a- numerical 
continuum,  one  can  characterize  them  as  belonging  to  latent  states.  The  latent 
states  correspond  to  Che  systematic  misconceptions,  or  to  mastery  of  the  concept. 
The  development  of  models  along  these  lines  is  now  the  goal  of  my  research  in 
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